Benchmark + in-place pipeline: SingleRust vs scanpy (PCA ~6×, ~3× overall, + scaling curve)#1
Open
iandriver wants to merge 3 commits into
Open
Benchmark + in-place pipeline: SingleRust vs scanpy (PCA ~6×, ~3× overall, + scaling curve)#1iandriver wants to merge 3 commits into
iandriver wants to merge 3 commits into
Conversation
Adds a focused demonstration of SingleRust's performance and its in-place processing ergonomics. In-place example (examples/inplace_pipeline.rs): load an AnnData once and run the whole pipeline against the same object — qc_metrics, normalize, log1p, HVG, and run_pca_inplace each mutate adata in place (obs/var, X, obsm["X_pca"], uns). `adata` isn't even `mut`: interior mutability lets every step take a shared &adata, so there is no per-step copy of the matrix. That allocation-avoidance is what the benchmark quantifies. Benchmark (demo/scverse_benchmark.ipynb + examples/bench_step.rs): runs each step one at a time, SingleRust vs the equivalent scanpy function, on a ~50k-cell CZI CELLxGENE slice, reporting compute time only (Rust .h5ad read/write measured separately and excluded). scanpy produces the state before each step and Rust runs that same step on it, so it's apples-to-apples. Indicative (18 cores, 50k × 35.5k): qc 4.3×, normalize 7.8×, hvg 7.0×, pca 6.5×, ora 3.2×; overall ~4.9× (13.8s → 2.8s). Supporting scaffolding: demo/prepare_data.py (parametrized Census fetch), markers.tsv, requirements.txt, README, .gitignore for .venv/data. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- examples/bench_step.rs: add an `all` step that runs the full core pipeline (qc→normalize→log1p→hvg→pca) in one process, printing per-step and total compute time — used by the scaling sweep. - demo/scverse_scaling.ipynb (+ _build_scaling_notebook.py): sweeps cell count 3k→50k, runs the full pipeline with SingleRust and scanpy at each size on the same raw subsample, and plots runtime-vs-cells (log–log) + speedup-vs-cells + per-step scaling. - Add an untimed warm-up to BOTH notebooks so timings aren't charged for scanpy's one-time numba JIT / thread-pool startup. This corrects the earlier per-step numbers, which were warm-up-inflated: real result is ~3.2× overall (PCA ~6×, normalize ~8×), with QC actually ~0.75× (scanpy's optimized C path wins) — flagged honestly in the README. - Scaling: SingleRust faster at every size (3–9×); margin largest at moderate sizes, narrowing by 50k as both become compute-bound. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a large-scale scaling study answering "does the advantage hold at 500k, and is memory a confound?" - prepare_data.py: --all-assays flag (the narrow 10x-3'-v3 filter caps at 277k cells; broadened blood/primary/normal has 4M, enabling a 500k base). - demo/_scanpy_pipeline.py: standalone scanpy pipeline runner with an internal warm-up, so scanpy can be measured as a subprocess too. - demo/scverse_scaling_large.ipynb (+ builder): sweeps 25k→500k, runs both tools under /usr/bin/time -l to capture compute time AND peak RSS on identical input; clean compute-bound sweep ≤200k, large sizes measured in isolation; plots runtime/speedup/memory vs cells. Findings (48 GB / 18-core, clean-state reference): - Scaling holds — SingleRust faster at every size — but the pure-compute speedup CONVERGES from ~5.8× (25k) to ~1.6× (500k): the small-N margins are low fixed overhead; at scale both are compute-bound (PCA dominates), gap ~1.6× (normalize/HVG/PCA 2–4×, QC reaches parity). - Memory IS a confound, and favors SingleRust: peak RSS ~2× smaller (≈10 vs ≈18 GB at 500k), so it stays compute-bound while scanpy pages. Two large-N confounds are documented: memory pressure (inflates scanpy 2–4× near the RAM limit, RSS plateaus) and thermal throttling under sustained benchmarking (slows both). RSS is immune to both and is the robust signal. requirements.txt: add psutil. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A focused demonstration of SingleRust's performance and its in-place processing
ergonomics:
examples/inplace_pipeline.rs— a minimal, annotated pipeline that loads an AnnData once andmutates it through every step (no copies).
demo/scverse_benchmark.ipynb+examples/bench_step.rs— a per-step runtime benchmark vsthe equivalent scanpy function on a ~50k-cell CZI CELLxGENE dataset.
demo/scverse_scaling.ipynb— a cells-vs-runtime scaling sweep (3k → 50k).Stacked PR: base is
integration/features-only(write_h5ad, run_pca_inplace, densenormalize/log1p, ORA, QC fix), so the diff here is just the benchmark, scaling, the in-place
example, and supporting scaffolding.
Why it's fast: in-place operations
Interior mutability means you pass a shared
&adatato each step and results accumulate on thesame object — no copy between steps, the expression matrix is never reallocated.
Benchmark (per step, 50k × 35.5k, 18 cores)
Compute time only (Rust
.h5adread/write excluded); a warm-up pass runs first so scanpy isn'tcharged for one-time numba JIT / thread-pool startup.
SingleRust wins decisively on the heavy/vectorizable steps (PCA ~6×, normalize ~8×). QC is the
honest exception — scanpy's optimized C path slightly beats SingleRust's
qc_metrics(the top-Nsegment proportions are the cost). * ORA has no scanpy equivalent; compared against a
NumPy/SciPy implementation of the same algorithm.
Scaling (runtime vs cells)
Faster at every size (3–9×). SingleRust scales ~linearly with non-zeros; scanpy carries higher
fixed per-step overhead, so the margin is largest at moderate sizes and narrows by 50k as both
become compute-bound.
Notes
.h5ad; reading from Python needsimport hdf5plugin.🤖 Generated with Claude Code
Scaling to 500k cells + memory (added)
demo/scverse_scaling_large.ipynbpushes to 500,000 cells (broadened all-assays blood query,genes fixed across sizes) and tracks peak RSS alongside time — both tools run under
/usr/bin/time -lon identical input. Clean-state reference (48 GB / 18-core, large sizesmeasured in isolation):
margin narrows from ~5.8× (25k) to ~1.6× (500k) — small-N wins are low fixed overhead; at
scale both are compute-bound (PCA dominates), gap ~1.6× (normalize/HVG/PCA 2–4×, QC at parity).
at 500k), so it stays compute-bound where scanpy starts paging. The notebook explicitly flags
the two large-N confounds on a laptop — memory pressure (inflates scanpy 2–4× near the RAM
limit while RSS plateaus) and thermal throttling under sustained benchmarking (slows both). RSS
is immune to both, so the lower-footprint result is the robust takeaway.